Mean Gamma Deviance (mean_gamma_deviance) — Regression Metric (From Scratch)#
Mean Gamma deviance is a scale-invariant regression loss for strictly positive targets. It measures error through the ratio between the true value and the prediction, making it a strong fit when relative errors matter (and variance tends to grow with the square of the mean).
Goals
Build intuition with small numeric examples + Plotly visuals
Derive the formula + key properties (non-negativity, scale invariance, asymmetry)
Implement
mean_gamma_deviancein NumPy (from scratch) and validate vs scikit-learnUse the loss to fit a simple Gamma regression (GLM) with gradient descent (low-level NumPy)
Summarize pros/cons, good use cases, and common pitfalls
Quick import#
from sklearn.metrics import mean_gamma_deviance
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from sklearn.linear_model import GammaRegressor
from sklearn.metrics import mean_gamma_deviance
from sklearn.model_selection import train_test_split
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
rng = np.random.default_rng(7)
np.set_printoptions(precision=4, suppress=True)
import plotly
import sklearn
print("numpy :", np.__version__)
print("pandas :", pd.__version__)
print("sklearn:", sklearn.__version__)
print("plotly :", plotly.__version__)
numpy : 1.26.2
pandas : 2.1.3
sklearn: 1.6.0
plotly : 6.5.2
Prerequisites#
Regression setup: targets \(y\) and predictions \(\hat y\)
Comfort with logs and ratios
(Optional) basic derivatives for the gradient / optimization section
Domain constraints
Requires strictly positive values: \(y_i > 0\) and \(\hat y_i > 0\).
1) Definition and notation#
Let:
\(y \in \mathbb{R}_{>0}^n\) be the true targets
\(\hat y \in \mathbb{R}_{>0}^n\) be the predictions
The Gamma deviance per sample is:
The mean Gamma deviance is:
In scikit-learn, mean_gamma_deviance is the Tweedie deviance with power \(p=2\).
Where this comes from (deviance / likelihood view)#
In generalized linear models (GLMs), a deviance compares your model to a “perfect” saturated model:
For the Gamma family (ignoring dispersion constants), the negative log-likelihood for one sample can be written as:
So the deviance contribution is:
This is exactly the formula used by mean_gamma_deviance.
Ratio form (why it measures relative error)#
Define the ratio
Then:
So the loss depends only on the ratio \(y/\hat y\):
scaling both \(y\) and \(\hat y\) by the same constant leaves \(r\) unchanged → same deviance
a “2× overprediction” (\(\hat y = 2y\)) is penalized the same whether \(y=1\) or \(y=1000\)
Non-negativity (and when it is 0)#
A key inequality for \(r>0\) is:
Rearranging:
Therefore \(d_i \ge 0\) for every sample, and:
\(d_i = 0\) iff \(r_i = 1\) iff \(y_i = \hat y_i\)
2) Intuition: shape of the penalty#
Asymmetry#
Because \(d_i\) depends on the ratio \(r=y/\hat y\):
underprediction (\(\hat y < y\)) gives \(r>1\) and grows roughly linearly in \(r\)
overprediction (\(\hat y > y\)) gives \(r<1\) and grows roughly like \(-\log r\)
So big underpredictions are typically penalized more strongly than equally-large overpredictions (in multiplicative terms).
Small-error approximation#
Let the relative error be
Using \(\log(1+\varepsilon) \approx \varepsilon - \varepsilon^2/2\) for small \(\varepsilon\):
So for small errors:
Interpretation: Gamma deviance behaves like squared relative error near the optimum.
# Visualize d(r) = 2 * (r - log r - 1)
r = np.logspace(-2, 2, 600) # r = y / y_hat
d = 2 * (r - np.log(r) - 1)
fig = px.line(
x=r,
y=d,
log_x=True,
title="Gamma deviance as a function of ratio r = y / ŷ",
labels={"x": "r = y / ŷ (log scale)", "y": "d(r)"},
)
fig.add_vline(x=1.0, line_dash="dash", line_color="black")
fig.update_layout(showlegend=False)
fig.show()
# A few multiplicative errors side-by-side
ratios = np.array([0.1, 0.5, 1.0, 2.0, 10.0]) # r = y / y_hat
contrib = 2 * (ratios - np.log(ratios) - 1)
pd.DataFrame({"r = y/ŷ": ratios, "d(r)": contrib}).style.format({"d(r)": "{:.4f}"})
| r = y/ŷ | d(r) | |
|---|---|---|
| 0 | 0.100000 | 2.8052 |
| 1 | 0.500000 | 0.3863 |
| 2 | 1.000000 | 0.0000 |
| 3 | 2.000000 | 0.6137 |
| 4 | 10.000000 | 13.3948 |
3) A tiny worked example#
We’ll compute mean Gamma deviance step-by-step and compare to scikit-learn.
y_true = np.array([2.0, 0.5, 1.0, 4.0])
y_pred = np.array([0.5, 0.5, 2.0, 2.0])
per_sample = 2 * (np.log(y_pred / y_true) + y_true / y_pred - 1)
df = pd.DataFrame(
{
"y_true": y_true,
"y_pred": y_pred,
"ratio r=y/ŷ": y_true / y_pred,
"per-sample deviance d_i": per_sample,
}
)
mgd_np = float(per_sample.mean())
mgd_sklearn = mean_gamma_deviance(y_true, y_pred)
(df.style.format({"ratio r=y/ŷ": "{:.3f}", "per-sample deviance d_i": "{:.4f}"}), mgd_np, mgd_sklearn)
(<pandas.io.formats.style.Styler at 0x796358e94da0>,
1.0568528194400546,
1.0568528194400546)
4) NumPy implementation (from scratch)#
A minimal implementation is just the formula plus:
shape checks
positivity checks (or optional clipping)
optional sample weights
def mean_gamma_deviance_np(y_true, y_pred, sample_weight=None, *, eps=0.0):
'''Mean Gamma deviance (NumPy).
Matches scikit-learn's definition:
d_i = 2 * (log(y_pred / y_true) + y_true / y_pred - 1)
mean = average(d_i, weights=sample_weight)
Notes
-----
- Requires y_true > 0 and y_pred > 0.
- If eps > 0, values are clipped to [eps, +inf) to avoid log/div-by-zero.
'''
y_true = np.asarray(y_true, dtype=float)
y_pred = np.asarray(y_pred, dtype=float)
if y_true.shape != y_pred.shape:
raise ValueError(f"Shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")
if eps > 0:
y_true = np.clip(y_true, eps, None)
y_pred = np.clip(y_pred, eps, None)
if np.any(y_true <= 0) or np.any(y_pred <= 0):
raise ValueError("mean_gamma_deviance requires strictly positive y_true and y_pred")
dev = 2 * (np.log(y_pred / y_true) + y_true / y_pred - 1)
return float(np.average(dev, weights=sample_weight))
# Validate vs scikit-learn (including sample weights)
y_true = rng.lognormal(mean=0.0, sigma=0.8, size=1_000)
y_pred = rng.lognormal(mean=0.1, sigma=0.8, size=1_000)
weights = rng.uniform(0.5, 2.0, size=1_000)
print("unweighted np :", mean_gamma_deviance_np(y_true, y_pred))
print("unweighted sklearn:", mean_gamma_deviance(y_true, y_pred))
print()
print("weighted np :", mean_gamma_deviance_np(y_true, y_pred, sample_weight=weights))
print("weighted sklearn :", mean_gamma_deviance(y_true, y_pred, sample_weight=weights))
unweighted np : 1.378436722535223
unweighted sklearn: 1.378436722535223
weighted np : 1.3822041665484144
weighted sklearn : 1.3822041665484144
5) Relative vs absolute metrics (scale invariance in one picture)#
Consider a fixed multiplicative error: \(\hat y = c\,y\).
Gamma deviance depends only on \(r=y/\hat y = 1/c\) → constant across scales.
Squared error \((y-\hat y)^2\) grows with the scale of \(y\).
y = np.logspace(0, 3, 80) # 1 .. 1000
c = 1.5 # 50% overprediction
y_hat = c * y
per_sample_gamma = 2 * (np.log(y_hat / y) + y / y_hat - 1)
per_sample_sq = (y - y_hat) ** 2
fig = go.Figure()
fig.add_trace(go.Scatter(x=y, y=per_sample_gamma, mode="lines", name="Gamma deviance (per-sample)"))
fig.add_trace(
go.Scatter(
x=y,
y=per_sample_sq,
mode="lines",
name="Squared error (per-sample)",
yaxis="y2",
)
)
fig.update_layout(
title="Same relative error (ŷ = 1.5·y) across scales",
xaxis=dict(title="y (log scale)", type="log"),
yaxis=dict(title="Gamma deviance (scale-invariant)"),
yaxis2=dict(title="Squared error (scale-dependent)", overlaying="y", side="right", type="log"),
legend=dict(x=0.02, y=0.98),
)
fig.show()
6) Gradients (useful for optimization)#
For a single sample:
Derivative w.r.t. the prediction \(\hat y\):
Log-link parameterization (common in Gamma regression)#
If you model predictions as strictly positive via
then by the chain rule:
This form is often numerically nicer because it avoids \(1/\mu^2\).
def grad_mean_gamma_deviance_wrt_mu(y_true, mu_pred):
# Gradient of mean Gamma deviance w.r.t. mu_pred (vector).
y_true = np.asarray(y_true, dtype=float)
mu_pred = np.asarray(mu_pred, dtype=float)
if np.any(y_true <= 0) or np.any(mu_pred <= 0):
raise ValueError("y_true and mu_pred must be strictly positive")
n = y_true.size
return (2.0 / n) * (mu_pred - y_true) / (mu_pred**2)
def finite_diff_grad(f, x, eps=1e-6):
x = x.astype(float).copy()
g = np.zeros_like(x)
for i in range(x.size):
x1 = x.copy(); x1[i] += eps
x2 = x.copy(); x2[i] -= eps
g[i] = (f(x1) - f(x2)) / (2 * eps)
return g
y = rng.lognormal(mean=0.0, sigma=0.7, size=7)
mu = rng.lognormal(mean=0.2, sigma=0.7, size=7)
f = lambda mu_vec: mean_gamma_deviance_np(y, mu_vec)
g_analytic = grad_mean_gamma_deviance_wrt_mu(y, mu)
g_numeric = finite_diff_grad(f, mu)
pd.DataFrame(
{"mu": mu, "grad analytic": g_analytic, "grad numeric": g_numeric, "abs diff": np.abs(g_analytic - g_numeric)}
).style.format("{:.3e}")
| mu | grad analytic | grad numeric | abs diff | |
|---|---|---|---|---|
| 0 | 1.477e+00 | 3.719e-02 | 3.719e-02 | 5.883e-11 |
| 1 | 1.181e+00 | 1.660e-01 | 1.660e-01 | 4.342e-11 |
| 2 | 1.026e+00 | -6.787e-02 | -6.787e-02 | 6.520e-11 |
| 3 | 1.491e+00 | 9.188e-02 | 9.188e-02 | 1.543e-11 |
| 4 | 1.860e+00 | 1.309e-01 | 1.309e-01 | 1.947e-11 |
| 5 | 2.803e-01 | 2.875e-01 | 2.875e-01 | 3.567e-11 |
| 6 | 9.893e-01 | 1.937e-01 | 1.937e-01 | 5.989e-12 |
7) Using mean Gamma deviance to optimize a model (Gamma regression)#
A common way to build a model that always predicts positive values is a log link:
We fit \(w\) by minimizing mean Gamma deviance:
Using the gradient w.r.t. \(\eta\):
and \(\eta = Xw\), the gradient is:
We’ll implement gradient descent and fit a toy dataset generated from a Gamma model.
# Synthetic Gamma-regression dataset (variance grows ~ mean^2)
n = 600
x1 = rng.normal(size=n)
x2 = rng.normal(size=n)
X_raw = np.column_stack([x1, x2])
X_raw = (X_raw - X_raw.mean(axis=0)) / X_raw.std(axis=0) # simple standardization
X = np.column_stack([np.ones(n), X_raw]) # add intercept
w_true = np.array([0.25, 0.9, -0.6])
eta_true = X @ w_true
mu_true = np.exp(eta_true)
shape_k = 6.0 # larger k => less noise (var = mu^2 / k)
y = rng.gamma(shape_k, scale=mu_true / shape_k)
(mu_true.min(), mu_true.mean(), mu_true.max(), y.min(), y.mean(), y.max())
(0.03858702982088678,
2.3581594109476653,
52.632742976010974,
0.03697952466385566,
2.2560484806999526,
67.10897917526945)
# Visualize heteroscedasticity (log-log scatter)
line = np.logspace(np.log10(mu_true.min()), np.log10(mu_true.max()), 200)
fig = go.Figure()
fig.add_trace(go.Scatter(x=mu_true, y=y, mode="markers", name="samples", marker=dict(opacity=0.55)))
fig.add_trace(go.Scatter(x=line, y=line, mode="lines", name="y = μ", line=dict(color="black", dash="dash")))
fig.update_layout(
title="Synthetic data: y ~ Gamma(mean=μ) (variance increases with μ)",
xaxis=dict(title="true mean μ", type="log"),
yaxis=dict(title="observed y", type="log"),
)
fig.show()
X_train, X_test, y_train, y_test, mu_train, mu_test = train_test_split(
X, y, mu_true, test_size=0.25, random_state=0
)
def fit_gamma_regression_gd(X, y, *, lr=0.05, n_iter=2500, clip_eta=20.0):
# Fit log-link Gamma regression by minimizing mean Gamma deviance.
n, d = X.shape
w = np.zeros(d)
history = np.zeros(n_iter)
for t in range(n_iter):
eta = np.clip(X @ w, -clip_eta, clip_eta)
mu = np.exp(eta)
history[t] = mean_gamma_deviance_np(y, mu)
# grad = (2/n) * X^T (1 - y/mu)
grad = (2.0 / n) * (X.T @ (1.0 - y / mu))
w -= lr * grad
return w, history
w_gd, history = fit_gamma_regression_gd(X_train, y_train, lr=0.05, n_iter=2500)
print("w_true:", w_true)
print("w_gd :", w_gd)
print()
print("train MGD:", mean_gamma_deviance_np(y_train, np.exp(np.clip(X_train @ w_gd, -20, 20))))
print("test MGD:", mean_gamma_deviance_np(y_test, np.exp(np.clip(X_test @ w_gd, -20, 20))))
w_true: [ 0.25 0.9 -0.6 ]
w_gd : [ 0.2218 0.8851 -0.5517]
train MGD: 0.16124082232077153
test MGD: 0.18385887160506967
fig = px.line(
x=np.arange(len(history)),
y=history,
title="Training mean Gamma deviance during gradient descent",
labels={"x": "iteration", "y": "mean_gamma_deviance"},
)
fig.show()
mu_pred_test = np.exp(np.clip(X_test @ w_gd, -20, 20))
line = np.logspace(np.log10(mu_test.min()), np.log10(mu_test.max()), 200)
fig = go.Figure()
fig.add_trace(go.Scatter(x=mu_test, y=mu_pred_test, mode="markers", name="pred", marker=dict(opacity=0.6)))
fig.add_trace(go.Scatter(x=line, y=line, mode="lines", name="ideal", line=dict(color="black", dash="dash")))
fig.update_layout(
title="Recovered mean μ on test set (gradient descent)",
xaxis=dict(title="true μ", type="log"),
yaxis=dict(title="predicted μ", type="log"),
)
fig.show()
# Compare to scikit-learn's GammaRegressor
# (it includes an intercept internally, so we pass features without the intercept column)
gr = GammaRegressor(alpha=0.0, max_iter=5000)
gr.fit(X_train[:, 1:], y_train)
mu_pred_sk = gr.predict(X_test[:, 1:])
print("sklearn [intercept, coef]:", np.r_[gr.intercept_, gr.coef_])
print("gd w :", w_gd)
print()
print("test MGD (gd) :", mean_gamma_deviance(y_test, mu_pred_test))
print("test MGD (sklearn):", mean_gamma_deviance(y_test, mu_pred_sk))
sklearn [intercept, coef]: [ 0.2218 0.8851 -0.5517]
gd w : [ 0.2218 0.8851 -0.5517]
test MGD (gd) : 0.18385887160506967
test MGD (sklearn): 0.18385841689188462
8) Pros, cons, and when to use mean Gamma deviance#
Pros#
Scale-invariant / relative: depends on \(y/\hat y\), so it treats multiplicative errors consistently across scales.
Well-matched to Gamma-like data (strictly positive, heteroscedastic with \(\mathrm{Var}(y) \propto \mu^2\)).
For log-link linear models (\(\mu = \exp(Xw)\)), the objective is convex in \(w\) (good optimization properties).
Cons#
Requires strictly positive \(y\) and \(\hat y\) (zeros/negatives break the log / division).
Can explode when predictions get very close to 0 (huge penalty).
Not in the same units as \(y\) (less directly interpretable than MAE/RMSE).
Asymmetric in multiplicative error: large underpredictions tend to be punished more than equally-large overpredictions.
Good use cases#
Modeling positive continuous quantities where relative error matters: insurance claim severity, costs/prices, rainfall amounts, time-to-complete, demand with strictly positive values.
9) Common pitfalls and diagnostics#
Zeros in the target: Gamma deviance requires \(y>0\).
If you have many zeros + positive continuous values, consider a Tweedie model with \(1<p<2\) (compound Poisson) or a two-part model.
Model outputs can go negative: if you use a plain linear model for \(\hat y\), it can produce invalid (≤0) predictions.
Use a positive link function (e.g., \(\exp(\cdot)\) or softplus) when training with Gamma deviance.
Numerical stability: clip predictions away from 0 (and sometimes clip the linear predictor before applying \(\exp\)).
Check relative residuals: since the loss is relative, inspect \((y-\hat y)/\hat y\) vs \(\hat y\).
10) Exercises#
Implement a weighted gradient descent version (use
sample_weight) and verify it matchesnp.averageweighting.Show empirically that, for small relative errors, Gamma deviance is close to squared relative error.
Compare three approaches on the same positive dataset:
minimize MSE on \(y\)
minimize MSE on \(\log y\) (lognormal-ish)
minimize Gamma deviance (Gamma-ish) and inspect which residual pattern looks most appropriate.
References#
scikit-learn API:
sklearn.metrics.mean_gamma_deviancescikit-learn User Guide: Tweedie deviance (
mean_tweedie_deviance)Generalized Linear Models (GLMs) and Gamma regression (log link)